1 Introduction

Building upon the contribution of (Brunori and Neidhoefer 2020) the goal of our project is to estimate inequality of opportunity (IOP) based on Machine Learning (ML) Techniques. The measurement of unequal opportunities goes back to Roemer (1998), who states that factors controlling individual outcomes (e.g income) consist of two categories: effort and circumstances. Effort summarizes all aspects over which individuals have control, while circumstances include all factors individuals cannot control. Individuals identified by the same exogenous circumstances and therefore characterized by similar background conditions belong to a circumstance type (Roemer, 1998). In what follows we want to analyze between-type disparities.

In the previous literature several methods have been applied to measure IOP. These include parametric, non-parametric and latent variable approaches (QUELlE). The reasons we rely upon ML methods is that, as outlined by (Brunori and Neidhoefer 2020), the empirical results are sensitive to model selection determined by the researcher. As largely discussed in the literature, model selection and thus partial observability of the real number of exogenous variables causes a downward bias in estimating opportunity estimates (Ferreira and Gignoux, 2011). With ML methods it is possible to circumvent problems like sample size limitations or non-fixed and non-additive effects of circumstances and thus balance the bias (Bourguignon et al., 2007; Ferreira and Gignoux, 2011). To be more precise in our empirical analysis we make use of conditional inference trees and conditional inference forests, which belong to the classification and regression tree (CART) methods popularized by (Breiman et al., 1984). CART methods lower the risk of arbitrary model selection by using an automated algorithm that splits the predictor space into non-overlapping areas to establish the best model for predicting the outcome variable. In the case of equality of opportunity the algorithm partitions the sample of respondents into different types. Moreover, the conditional inference algorithm employs sequences of hypothesis tests which restrains model overfitting and thus trades off upward and downward biases (Hothorn et al., 2006). Another advantage of CART methods is their intuitive graphical representation, which makes them easily accessible to a large audience.

The project is organized as follows: First, after properly assessing our data, we make use of the income and living conditions data of Austria in 2019 to estimate the Conditional Inference Trees and a Conditional Inference Forest for Austria. Second, we apply the same data wrangling and data analysis procedure to the synthetic EU-SILC 2011 data set. Here we estimate the trees and forest for 6??? countries. In the last part, we compare the results across countries and conclude.

Libraries

library(bibtex)         # citations
library(knitcitations); cleanbib()
cite_options(citation_format = "pandoc", check.entries=FALSE)

library(tidyverse)      
library(readr)          # import
library(rpart)          # regression trees
library(rpart.plot)     # regression tree plots
library(summarytools)   # summary statistics
library(party)          # ctree
library(partykit)       # ctree
library(caret)
library(forecast)
library(ineq)           # Gini
library(precrec)        # ROC curves
library(corrplot)       # Correlation plots
library(plotly)         # interactive ggplot2 plots :D

2 Austria: EU-SILC 2019

2.1 Data import, wrangling, exploration and visualization

2.1.1 Data import

In the first part of our data analysis we rely upon the data of Statistics Austria. This data contains the survey data of the income and living conditions for Austria in the year 2019. In addition to that the data set includes an ad hoc module with many of the intergenerational transmission variables needed to properly asses inequality of opportunity of the respondents.

# setting the data path
data_path ="./AT2019"
# accessing the data
data19 <- read.csv(file.path(data_path, "p_silc2019_ext.csv"), sep = ";")

2.1.2 Data Wrangling

The first step of our data wrangling part is to select the variables of interest. These are based upon the list of circumstances chosen in (Brunori, Hufe, and Mahler 2018). The circumstances include both characteristics describing the respondent and circumstances related to the intergenerational transmission of the respondent. Personal characteristics are: sex and country of birth. Intergenerational circumstances include: the presence of parents at home, the number of adults present at home (aged 18 and over), the number of working adults present and the number of children (under 18) present at home, all when the respondent was at age 14. Further intergenerational circumstances are the level of education of the respondents parents, their occupational status, main occupation and if they held a managerial position, their citizenship and the tenancy status.

Since annual income was not properly provided in the Statistic Austria dataset for 2019, we use gross monthly income as our dependent variable and approximate the yearly income by multiplying the monthly income by 12. Moreover, we only contain values larger than zero in our income variable, since values below zero indicate people which where not able to provide information or refused to answer the survey questions concerning the income topic. Furthermore, since the income is commonly assumed to be right skwed because of individual disproportionately high income earners which we take the log of the income variable.

After selecting the variables described from the original dataset, we rename all the variables and save them to our main dataset “data19”. Building on this dataset we further limit our data by the age. We only include respondents aged between 27 and 59 since this captures the working population. In the next step we drop all answers where the respondents refused or were not able to provide information concerning the intergenerational circumstances like e.g. father or mother citizenship. We do not do this for all variables since it would leave us with too little observations (e.g. dropping adults). Next we need to recode some of the variables from characters into factors or numeric variables in order to later calculate the conditional inference trees.

data19 <- data19 %>%
  
  select(sex, P038004, P110000nu, P111010nu, alter, M009010, M010000, M014000, M016000, M017000, M020010, M021000, M025000, M027000, M028000, M004000, M001300, M001510, M003100, M001100, M001200, M002000, M001500) %>% 
  
  rename("inc_net"           = P038004,     # gross monthly income
         "country_birth"     = P110000nu,   # country of birth of respondent
         "citizenship"       = P111010nu,   # citizenship of respondent
         "age"               = alter,       # age of respondent
         "father_cit"        = M009010,     # citizenship of father at age 14
         "father_edu"        = M010000,     # education of father at age 14 (höchster abschluss)
         "father_occup_stat" = M014000,     # occupational status of father at age 14
         "father_occup"      = M016000,     # main occupation of father at age 14
         "father_manag"      = M017000,     # managerial position of father at age 14            
         "mother_cit"        = M020010,     # citizenship of mother at age 14
         "mother_edu"        = M021000,     # education of mother at age 14
         "mother_occup_stat" = M025000,     # occupational status of mother at age 14
         "mother_occup"      = M027000,     # main occupation of mother at age 14
         "mother_manag"      = M028000,     # managerial position of mother at age 14
         "tenancy"           = M004000,     # tenancy at age 14
         "children"          = M001300,     # number of children (under 18) in respondent’s household at age 14
         "adults"            = M001510,     # number of adults (aged 18 or more) in respondent’s household
         "adults_working"    = M003100,     # number of working adults (aged 18 or more) in respondent’s hhd.
         "father_present"    = M001100,     # father present in respondent’s household at age 14
         "mother_present"    = M001200,     # mother present in respondent’s household at age 14
         "adults_present"    = M001500,     # adults present in respondent’s household at age 14
         ) %>%    
  
  filter(age %in% (27:59), inc_net > 0, mother_present > 0, father_present > 0, father_cit > 0, mother_cit > 0)  %>%
  # DAN: Hier schau ich noch dass die observations nicht komplett gedropt werden wenn bei einer variable NA
  # DAN: Income schau ich mir auch noch an, ob wir da nit annual income verwenden können
  # M002000
  
  mutate( "inc_net_log" = log(inc_net*12),   
         # logged net income per month of respondent
         
         "both_parents_present" = father_present + mother_present,            
         # 4 = none present, 3 = one present, 2 = both present
         
         sex = factor(ifelse(as.numeric(sex)==2, 1, 0)), 
         # 0 = male, 1 = female
         
         country_birth = factor(country_birth, labels = c(1, 2, 2, 2, 3, 3)), 
         # Austria = 1, EU = 2, Non-EU = 3
         
         father_cit = ifelse(father_cit == 1, 1, 2),                          
         # Austria = 1 and Other = 2
         
         mother_cit = ifelse(mother_cit == 1, 1, 2))   
         # Austria = 1 and Other = 2

2.1.3 Data Summary

print(dfSummary(data19), method="render", style="grid", plain.ascii = F)

Data Frame Summary

data19

Dimensions: 3600 x 25
Duplicates: 1
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 sex [factor] 1. 0 2. 1
1729(48.0%)
1871(52.0%)
3600 (100.0%) 0 (0.0%)
2 inc_net [integer] Mean (sd) : 1979.5 (972) min < med < max: 60 < 1900 < 13000 IQR (CV) : 1010 (0.5) 621 distinct values 3600 (100.0%) 0 (0.0%)
3 country_birth [factor] 1. 1 2. 2 3. 3
2991(83.1%)
425(11.8%)
184(5.1%)
3600 (100.0%) 0 (0.0%)
4 citizenship [integer] Mean (sd) : 1.3 (0.9) min < med < max: 1 < 1 < 6 IQR (CV) : 0 (0.7)
1:3166(87.9%)
2:129(3.6%)
3:150(4.2%)
4:53(1.5%)
5:36(1.0%)
6:66(1.8%)
3600 (100.0%) 0 (0.0%)
5 age [integer] Mean (sd) : 43.5 (9.1) min < med < max: 27 < 44 < 58 IQR (CV) : 15 (0.2) 32 distinct values 3600 (100.0%) 0 (0.0%)
6 father_cit [numeric] Min : 1 Mean : 1.2 Max : 2
1:2979(82.8%)
2:621(17.2%)
3600 (100.0%) 0 (0.0%)
7 father_edu [integer] Mean (sd) : 3 (2.7) min < med < max: -2 < 2 < 9 IQR (CV) : 2 (0.9) 11 distinct values 3600 (100.0%) 0 (0.0%)
8 father_occup_stat [integer] Mean (sd) : 1 (1.3) min < med < max: -5 < 1 < 4 IQR (CV) : 0 (1.3)
-5:124(3.4%)
-2:24(0.7%)
1:2785(77.4%)
2:595(16.5%)
3:17(0.5%)
4:55(1.5%)
3600 (100.0%) 0 (0.0%)
9 father_occup [integer] Mean (sd) : 4.8 (3.1) min < med < max: -5 < 5 < 9 IQR (CV) : 4 (0.6) 13 distinct values 3598 (99.9%) 2 (0.1%)
10 father_manag [integer] Mean (sd) : 1.2 (1.5) min < med < max: -5 < 2 < 2 IQR (CV) : 1 (1.2)
-5:124(3.4%)
-3:55(1.5%)
-2:81(2.2%)
1:1286(35.7%)
2:2054(57.1%)
3600 (100.0%) 0 (0.0%)
11 mother_cit [numeric] Min : 1 Mean : 1.2 Max : 2
1:2998(83.3%)
2:602(16.7%)
3600 (100.0%) 0 (0.0%)
12 mother_edu [integer] Mean (sd) : 2.7 (2.5) min < med < max: -2 < 2 < 9 IQR (CV) : 3 (0.9) 11 distinct values 3600 (100.0%) 0 (0.0%)
13 mother_occup_stat [integer] Mean (sd) : 2.1 (1.5) min < med < max: -5 < 1 < 4 IQR (CV) : 3 (0.7)
-5:29(0.8%)
-2:10(0.3%)
1:1931(53.6%)
2:272(7.6%)
3:209(5.8%)
4:1149(31.9%)
3600 (100.0%) 0 (0.0%)
14 mother_occup [integer] Mean (sd) : 2.6 (4.3) min < med < max: -5 < 4 < 9 IQR (CV) : 8 (1.7) 13 distinct values 3599 (100.0%) 1 (0.0%)
15 mother_manag [integer] Mean (sd) : 0.1 (2.3) min < med < max: -5 < 2 < 2 IQR (CV) : 5 (22.1)
-5:29(0.8%)
-3:1149(31.9%)
-2:137(3.8%)
1:322(8.9%)
2:1963(54.5%)
3600 (100.0%) 0 (0.0%)
16 tenancy [integer] Mean (sd) : 1.7 (0.6) min < med < max: -3 < 2 < 3 IQR (CV) : 1 (0.4)
-3:21(0.6%)
-2:5(0.1%)
1:968(26.9%)
2:2555(71.0%)
3:51(1.4%)
3600 (100.0%) 0 (0.0%)
17 children [integer] Mean (sd) : 1.1 (0.5) min < med < max: -3 < 1 < 2 IQR (CV) : 0 (0.4)
-3:22(0.6%)
-2:1(0.0%)
1:2964(82.3%)
2:613(17.0%)
3600 (100.0%) 0 (0.0%)
18 adults [integer] Mean (sd) : -2.1 (1.8) min < med < max: -3 < -3 < 7 IQR (CV) : 0 (-0.8)
-3:2903(80.6%)
-2:1(0.0%)
1:410(11.4%)
2:248(6.9%)
3:20(0.6%)
4:13(0.4%)
5:2(0.1%)
6:2(0.1%)
7:1(0.0%)
3600 (100.0%) 0 (0.0%)
19 adults_working [integer] Mean (sd) : 1.8 (1.1) min < med < max: -3 < 2 < 14 IQR (CV) : 1 (0.6) 15 distinct values 3600 (100.0%) 0 (0.0%)
20 father_present [integer] Min : 1 Mean : 1.1 Max : 2
1:3119(86.6%)
2:481(13.4%)
3600 (100.0%) 0 (0.0%)
21 mother_present [integer] Min : 1 Mean : 1 Max : 2
1:3464(96.2%)
2:136(3.8%)
3600 (100.0%) 0 (0.0%)
22 M002000 [integer] Mean (sd) : -2.9 (0.9) min < med < max: -3 < -3 < 8 IQR (CV) : 0 (-0.3)
-3:3526(97.9%)
-2:1(0.0%)
1:32(0.9%)
2:12(0.3%)
3:5(0.1%)
4:2(0.1%)
5:1(0.0%)
6:8(0.2%)
7:11(0.3%)
8:2(0.1%)
3600 (100.0%) 0 (0.0%)
23 adults_present [integer] Mean (sd) : 1.7 (0.8) min < med < max: -3 < 2 < 2 IQR (CV) : 0 (0.5)
-3:22(0.6%)
-2:105(2.9%)
1:697(19.4%)
2:2776(77.1%)
3600 (100.0%) 0 (0.0%)
24 inc_net_log [numeric] Mean (sd) : 10 (0.5) min < med < max: 6.6 < 10 < 12 IQR (CV) : 0.5 (0.1) 621 distinct values 3600 (100.0%) 0 (0.0%)
25 both_parents_present [integer] Mean (sd) : 2.2 (0.4) min < med < max: 2 < 2 < 4 IQR (CV) : 0 (0.2)
2:3057(84.9%)
3:469(13.0%)
4:74(2.1%)
3600 (100.0%) 0 (0.0%)

Generated by summarytools 0.9.8 (R version 4.0.3)
2021-02-17

After cleaning our data we end up with a data set that includes 24 variables and 3295 observation. There are almost no missing values anymore since we corrected for refused survey answers or dropped the values where participants where not able to answer the survey questions. There are slightly more females (52%) than males (48%) in the data. The income distribution is skewed to the right meaning the median income is lower than the mean income (we take the logarithm to correct for this). 84% of the respondents where born in Austria and 88% are Austrian citizens. 84% of the respondent’s fathers and also 84% of the respondent’s mothers held the Austrian citizenship. 82% of the fathers, but only 54% of the mothers, were employed when the respondents where at age 14. 38% of the employed fathers and 9% of the employed mothers held a managerial position. About 72% of the respondents lived in not owned houses at age 14 and 83% lived together with other children. In 91% and 98% of the time either the father or the mother was present, in 90% of the time both were present. 20% of the respondents lived with adults other than their parents.

2.1.4 Gini Index and Lorenz curve

In order to get a first glimpse on how high or low inequality in general is in Austria we calculate and visualize the Gini coefficient.

ineq(data19$inc_net, type = "Gini")
## [1] 0.2543448

The Gini index is 0.25 which is a bit lower than the World Bank estimate for Austria of 0.3 (2017) available at https://data.worldbank.org/indicator/SI.POV.GINI?locations=AT.

plot(Lc(data19$inc_net), col = "darkred", lwd = 3)

The Gini index corresponds to the are below the the black equal distribution line and above the red line of the actual distribution.

2.1.5 Age pyramid

agepyra <- ggplot(data19, aes(x = age, fill=sex))  + 
  geom_bar(data = subset(data19, sex==1)) +
  geom_bar(data = subset(data19, sex==0), aes(y=..count..*(-1))) + 
  scale_x_continuous(breaks = seq(27,59,2), labels=abs(seq(27,59,2))) +
  scale_fill_manual(name = "Sex", labels = c("Male", "Female"), values=c("springgreen2", "slateblue1")) +
  labs(title = "Age pyramide of ad-hoc module on intergenerational transmission of disadvantages", x = "Age", y = "Number of people") +
  theme_bw() +
  coord_flip()

ggplotly(agepyra)

In the Data Frame Summary above we already saw that there are slightly more females (1) than males (0) in the data set and that the median age is 44 - while the age distribution of the sample is quite evenly distribution there are a bit more older than young individuals.

median(data19$age)
## [1] 44

2.1.6 Correlation plot

data19cor <- data19
data19cor$sex <- as.numeric(data19cor$sex)
data19cor$country_birth <- as.numeric(data19cor$country_birth)

# Dropping the categorical variables father_occup & mother_occup
data19cor <- select(data19cor, -c(father_occup, mother_occup))

# Computing correlation coefficients and significance thereof 
data19cor <- cor(data19cor)
res1 <- cor.mtest(data19cor, conf.level = 0.99)

corrplot(data19cor, method = "ellipse", type = "upper", order = "FPC", diag = FALSE, outline = FALSE, tl.cex = .5, tl.col = "black", title = "Correlation plot", p.mat = res1$p, sig.level = 0.01, insig = "blank", mar=c(2,2,2,2))

As can be seen from the correlation plot, all variables are significantly related to at least one other variable of the data set (at the 1% significance level). For better visibility insignificant correlations are blanked out. As the correlation matrix is ordered using the first principal component there is some clustering of significant correlations.

2.2 Method: Conditional Inference Trees and Forests

To estimate equality of opportunity we let an automatex algortihm decide the partition of the population into mutually exclusive types, in order to obtain a measure of inequality of opportunity. We follow the procedure described by (Brunori, Hufe, and Mahler 2018). We show our results using both classification and regression trees and conditional inference trees. We put more emphasis on the latter. Conditional inference trees and conditional inference forests are a technique developed and described by (???). We break down the main characteristics for our purposes:

The essential R function we use are: - ctree from party package in R - recursive partitioning just like rpart - rpart: maximizing an information measure - ctree: significance test procedure - caret: for additional cross validation to ctree_control

Advantages of Trees: Next to being rather straightforward to interpret using such an algorithm minimizes the degree of randomness and arbitrariness in model selection. Trees show outcome variability without initially assuming which circumstances play a significant role in shaping the individual opportunities or how the interact (Brunori, Hufe, and Mahler 2018).

Advantages of Trees over linear regression models: very large set of observations can be used & model specification is no longer exogenously given

Advantages of Conditional Inference Trees over Regression and Classification Trees (CART): the algorithm automatically provides a test for the null hypothesis of equality of opportunity and prevents overfitting while CART “cannot distinguish between a significant and an insignificant improvement in the information measure” (Mingers 1987, as cited in Torsten Hothorn (n.d.), 2) and consider the distributional properties of the measures. Since the algorithm avoids upwards and downwards biases, the estimates obtained are better suited for comparisons across time (i.e. Austria 2011 to Austria 2019) and across countries (EU-SILC) even when samples sizes are different (Brunori, Hufe, and Mahler 2018).

Procedure

Empirical approach as described in (Brunori, Hufe, and Mahler 2018, 4): > We consider a population size for each country of size \(N\) which is indexed by \(i \in \{1, ..., N \}\) and a vector of incomes \(Y=\{y_1,...,y_i,...,y_N \}\). Our assumption is that each individual i’s income is the result of two sets of factors. A set of circumstances, which are beyond her control and for which we have observations of size \(P: \Omega_i =\{ C^1_i, ..., C^p_i, ..., C^P_i\}\). Then, there is a set of efforts, which we do not observe, of size \(Q: \Theta_i = \{E^1_i, ..., E^q_i, ..., E^Q_i \}\). This results in a very general outcome generating function \(g: \Omega \times \Theta \rightarrow \mathbb{R}_+\) or \[y_i = g(\Omega_i, \Theta_i) \]. Each circumstance \(C^p \in \Omega\) has a total of \(X^p\) realizations and each one is denoted as \(x^p\). The conditional inference trees partition the population into a set of non-overlapping types, whereby a type is a subgroup of the original population in therms of circumstances. We have type \(T=\{t_1, ..., t_m,...,t_M \}\) and invidiuals i and j belong to the same type as in: \(t_m \in T\) if \(x^p_i = x^p_j \forall C^p \in \Omega\). Likewise, they belong to different types \(t_m \in T\) if \(\exists C^p \in \Omega : x^p_i \ne x^p_j\). Types define a particular way of partitioning the population into subgroups, and group membership indicates uniformity in circumstances (types). In essence this means that the approach we utilize here is an ex-ante view that focuses on between-type differences in the value of opportunity sets without paying attention to the effort realizations of individuals. The tree-based method obtains the predictions for our outcome y as a function of the input variables I, our observed circumstances. The method uses the set of variables to partition the population into a set of non-overlapping groups, \(G= \{g_1,...,g_m,...,g_M \}\) and each group is homogeneous in expressing each input variable. Graphically these groups are identified as terminal nodes or leafs. The tree also gives us the predicted outcome value per observation. This means that in addition to the observed income vector \(Y=\{y_1,...,y_i,...,y_N \}\) we also obtain a vector of predicted values \(\hat{Y}=\{\hat{y}_1,...,\hat{y}_i,...,\hat{y}_N \}\) where \[\hat{y}_i = \mu_m = \frac {1}{N_M} \sum_{i \in g_m} y_i, \forall i \in g_m, \forall g_m \in G\]. The interpretation of the regression trees is then that conditional on the input variables being circumstances only (\(I \subseteq \hat{\Omega} \subseteq \Omega\)) each resulting group \(g_m \in G\) can be interpreted as a circumstance type \(t_m \in T\). Importantly the predicted value \(\hat{Y}\) is analogous to the smoothed distribution of \(Y\) and is our prediction of equal incomes within a group.

The algorithm of conditional inference trees follows a step-wise procedure of permutation tests as described by (Brunori, Hufe, and Mahler 2018, 7–8):

  1. Choose confidence level Test the null hypothesis of independence, \(H_0^{C^p} : D(Y|C^P) = D(Y)\), for each input variable \(C^P \in \hat{\Omega}\), and obtain a p-value associated with each test, \(p^{C^p}\). \(\implies\) We adjust the p-values for multiple hypothesis testing, such that \(p_{adj.}^{C^p} = 1-(1-p^{Cp})^P\), which essentially means that we use the so called Bonferroni Correction.
  2. Choose feature: test all the null hypotheses of independence between the individual outcome and each of all the observable circumstances (variables). The model selects a variable, \(C^*\), with the lowest adjusted p-value. Essentially we choose such that \(C^* = \{C^P : \text{argmin} ~ p_{adj.}^{C^p} \}\).
  1. no hypothesis can be rejected: stop \(\implies\) If \(p_{adj.}^{C^p} > \alpha\): Exit the algorithm.
  2. one or more circumstance is siginificant: select the circumstance with the smallest p-value and proceed \(\rightarrow\) If \(p_{adj.}^{C^p} \leq \alpha\): Continue, and select \(C^*\) as the splitting variable.
  1. Choose split: for every possible way the selected circumstance can divide the sample into two subgroups, test the hypothesis of same mean outcome in the two resulting subgroups. Choose the splitting point with the smallest p-value. Technically, we test the discrepancy between the subsamples for each possible binary partition, s, based on \(C^*\), meaning that \(Y_s = \{Y_i : C^*_i < x^p \}\) and \(Y_{-s} = \{Y_i : C^*_i \geq x^p \}\), and obtain a p-value associated with each test, \(p^{C^*_s}\).

\(\implies\) The the Split sample based on \(C^*_s\), by choosing the split point s that yields the lowest p-value, which is \(C^*_s = \{C^*_s : \text{argmin} ~ p^{C^*_s} \}\). 4. Repeat :)

In the context of estimating inequality of opportunity conditional inference trees offer a particular structure. Namely each hypothesis is a test for whether equal opportunity exists within a group. If the tree results in no splits we cannot reject the null hypothesis of equality of opportunity. While the deeper the tree is grown, the more types are required to account for inequality of opportunity in the country under consideration. Each split (parent node) thus indicates that the opportunities of the two groups are significantly different, while we cannot say the same for the groups included in the leaf. nodes.

2.3 Empirical Analysis

2.3.1 Regression Tree

To showcase the difference between the regression and classification trees we discussed in class and the conditional inference trees we also plot the former as comparison. In the following chunk of code we use set.seed to generate randomness for reproducability. We define our formula which consists of all the circumstances we use for estimation. Furthermore, we split the data into a training and test sample. Finally we define a fitControl which is our tuning function for cross validation using the caret package.

set.seed(123)

formula = inc_net_log ~ sex + country_birth + father_cit + father_edu + father_occup_stat + father_occup + father_manag + mother_cit + mother_edu + mother_occup_stat + mother_occup + mother_manag + tenancy + children + adults + adults_working + both_parents_present

data19 <- data19 %>%
  mutate(train_index = sample(c("train", "test"), nrow(data19), replace=TRUE, prob=c(0.80, 0.20)))

train <- data19 %>% filter(train_index=="train")
test <- data19 %>% filter(train_index=="test")
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = T)

tuning_grid <- expand.grid(cp = seq(0, 0.02, by= 0.005))
tuning_grid
##      cp
## 1 0.000
## 2 0.005
## 3 0.010
## 4 0.015
## 5 0.020
caret_rpart <- train(formula, data = data19, method = "rpart", trControl = fitControl, tuneGrid = tuning_grid, metric = "RMSE", na.action = na.pass)

caret_rpart
## CART 
## 
## 3600 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 3240, 3240, 3240, 3239, 3240, 3240, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE       Rsquared   MAE      
##   0.000  0.5170694  0.1177017  0.3801252
##   0.005  0.4795502  0.1778021  0.3465238
##   0.010  0.4834562  0.1642072  0.3498513
##   0.015  0.4851941  0.1581771  0.3514803
##   0.020  0.4851522  0.1583089  0.3514609
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.005.
tree_caret_final <- caret_rpart$finalModel
rpart.plot(tree_caret_final, box.palette="RdBu", nn=FALSE, type=2)

The caret_tree for Austria shows a tree with five partitions and seven terminal nodes. The splitting variables indicate already which circumstances are most significant in determining the income of the respondents. In Austria, sex is the most important determinant for income. However, we only showcase this tree as a comparison to the conditional inference trees.

2.3.2 Conditional inference tree

The conditional inference tree algorithm as it is included in party and the partykit packages contains various points for adjustment of variable selection and stopping criteria. We use the default ctree_control function but see it necessary to explain what it does exactly and why we think that the specifications we have chosen are not distorting. We use the default teststatistic as we do not know neither the conditional expectation nor covariance of our circumstances. In such a case the default setting of ctree_control(teststat = "quad") is recommended (???). In the Austria 2019 we have mostly cleaned the data of missing values, however we still have many NA entries furhter on in the document. For reasons of uniformity, we chose to use the testtype ctree_control(testtype = "Bonferroni"). This approach uses simple Bonferroni adjusted P-Values. However, we also use the caret package for cross validation in addition to the default control function, since it is the one we had discussed in class and we used it for comparability of the results.

# For the Inference Tree to work, we must have all variables as numeric data

Ctree <- ctree(formula, data = train, control = ctree_control(testtype = "Bonferroni")) 
Ctree
## 
## Model formula:
## inc_net_log ~ sex + country_birth + father_cit + father_edu + 
##     father_occup_stat + father_occup + father_manag + mother_cit + 
##     mother_edu + mother_occup_stat + mother_occup + mother_manag + 
##     tenancy + children + adults + adults_working + both_parents_present
## 
## Fitted party:
## [1] root
## |   [2] sex in 0
## |   |   [3] country_birth in 1, 2
## |   |   |   [4] mother_cit <= 1
## |   |   |   |   [5] father_edu <= 2
## |   |   |   |   |   [6] mother_edu <= -2: 9.773 (n = 9, err = 6.5)
## |   |   |   |   |   [7] mother_edu > -2: 10.181 (n = 777, err = 100.1)
## |   |   |   |   [8] father_edu > 2: 10.265 (n = 365, err = 73.9)
## |   |   |   [9] mother_cit > 1
## |   |   |   |   [10] father_edu <= 6
## |   |   |   |   |   [11] adults_working <= 0: 9.577 (n = 7, err = 1.7)
## |   |   |   |   |   [12] adults_working > 0: 10.025 (n = 116, err = 9.9)
## |   |   |   |   [13] father_edu > 6: 10.238 (n = 46, err = 12.8)
## |   |   [14] country_birth in 3: 9.797 (n = 71, err = 26.7)
## |   [15] sex in 1
## |   |   [16] mother_edu <= 2
## |   |   |   [17] father_cit <= 1: 9.743 (n = 897, err = 239.5)
## |   |   |   [18] father_cit > 1: 9.573 (n = 191, err = 58.9)
## |   |   [19] mother_edu > 2: 9.860 (n = 398, err = 132.1)
## 
## Number of inner nodes:     9
## Number of terminal nodes: 10
plot(Ctree, type = "simple",gp = gpar(fontsize = 6),
  inner_panel=node_inner,
  ip_args=list(id = FALSE), main = "Conditional Inference Tree for Austria 2019") #Überschrift größe ändern

We obtain a deep tree with 11 inner nodes and 12 terminal nodes. Sex is the most siginificant determinant for income. The further determinants are quite different for men (0) and women (1).

### data = in data 2019 geändert von train!
caret_ctree <- train(formula, data = data19, method = "ctree", trControl = fitControl, na.action = na.pass)
caret_ctree
## Conditional Inference Tree 
## 
## 3600 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 3240, 3240, 3240, 3240, 3239, 3240, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  RMSE       Rsquared   MAE      
##   0.01          0.4926421  0.1504879  0.3568914
##   0.50          0.4793687  0.1800009  0.3448702
##   0.99          0.4781751  0.1826241  0.3442583
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mincriterion = 0.99.
caret_ctree_B <- ctree(formula, data = data19, control = ctree_control(testtype = "Bonferroni", mincriterion = 0.99)) 
caret_ctree_B
## 
## Model formula:
## inc_net_log ~ sex + country_birth + father_cit + father_edu + 
##     father_occup_stat + father_occup + father_manag + mother_cit + 
##     mother_edu + mother_occup_stat + mother_occup + mother_manag + 
##     tenancy + children + adults + adults_working + both_parents_present
## 
## Fitted party:
## [1] root
## |   [2] sex in 0
## |   |   [3] country_birth in 1, 2
## |   |   |   [4] mother_cit <= 1
## |   |   |   |   [5] mother_occup_stat <= -2: 9.791 (n = 15, err = 7.1)
## |   |   |   |   [6] mother_occup_stat > -2: 10.210 (n = 1416, err = 209.5)
## |   |   |   [7] mother_cit > 1
## |   |   |   |   [8] father_edu <= 6: 10.003 (n = 152, err = 21.4)
## |   |   |   |   [9] father_edu > 6: 10.305 (n = 54, err = 15.7)
## |   |   [10] country_birth in 3: 9.811 (n = 92, err = 32.6)
## |   [11] sex in 1
## |   |   [12] father_edu <= 2
## |   |   |   [13] father_cit <= 1: 9.737 (n = 1055, err = 290.7)
## |   |   |   [14] father_cit > 1: 9.521 (n = 204, err = 64.0)
## |   |   [15] father_edu > 2: 9.857 (n = 612, err = 169.4)
## 
## Number of inner nodes:    7
## Number of terminal nodes: 8
plot(caret_ctree_B,gp = gpar(fontsize = 6),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Austria 2019 - Cross Validated")

plot(caret_ctree,gp = gpar(fontsize = 6),
  inner_panel=node_inner,
  ip_args=list(id = FALSE))

plot(caret_ctree$finalModel, type = "simple")

We plot the conditional inference tree using both the caret and the party package to compare the results. In the first we used the suggested mincriterion = 0.99 obtained through cross validation. We now have a tree with 9 terminal nodes and 8 inner nodes. Sex remains the most significant determinant. Furthermore, we see for example that country of birth is significant if its in the EU or outside of it. The Boxplots at the terminal nodes show the distribution of the outcome variable at the node. All, splits are chosen at a significance level of p < 0.001 which is also in line with the discplinary convention for hypothesis test (Brunori, Hufe, and Mahler 2018, 9).

2.3.3 Graphic representation of the tuning parameters

The graph below shows how the P-Value Threshold is adjusted using the RMSE as an anchor. As the lowest RMSE is achieved using the strictest P-Value, that is the one we chose.

plot(caret_ctree) # RMSE vs p-value our resampling parameter

plot(caret_rpart)

# plotcp(tree_1)

2.3.4 Predictions

To test the predictive power of the models, we calculate the Root Mean Squared errors. Arguably, the cross validated conditional inference tree performs best in terms of small RMSE.

test$P_AtCt <- predict(Ctree, newdata = as.data.frame(test))
test$perror <- (test$P_AtCt - test$inc_net_log)^2
test$RMSE <- sqrt(sum((test$P_AtCt - test$inc_net_log)^2/nrow(test), na.rm = T))
head(test$RMSE)
## [1] 0.459181 0.459181 0.459181 0.459181 0.459181 0.459181
test$P_AtCt_caret <- predict(caret_rpart, newdata = as.data.frame(test))
test$perror_caret <- (test$P_AtCt_caret - test$inc_net_log)^2
test$RMSE_caret <- sqrt(sum((test$P_AtCt_caret - test$inc_net_log)^2/nrow(test), na.rm = T))
head(test$RMSE_caret)
## [1] 0.4566333 0.4566333 0.4566333 0.4566333 0.4566333 0.4566333
test$P_AtCt_caret <- predict(caret_ctree_B, newdata = as.data.frame(test))
test$perror_caret <- (test$P_AtCt_caret - test$inc_net_log)^2
test$RMSE_caret <- sqrt(sum((test$P_AtCt_caret - test$inc_net_log)^2/nrow(test), na.rm = T))
head(test$RMSE_caret)
## [1] 0.4536642 0.4536642 0.4536642 0.4536642 0.4536642 0.4536642

2.3.5 Conditional Inference Forest

The procedure and application of Conditional Inference Forests follows the application in (Brunori, Hufe, and Mahler 2018, 10) As discussed the conditional inference trees construct as outcome the counterfactual distribution of the income variable. However, conditional inference trees only use limited information of the set of observed circumstances, since not all circumstances \(C^p \in \hat{\Omega}\) are utilized. Furthermore, the predictions (the values of the opportunity sets) have high variance. Conditional Inference Forests are able to deal with the shortcomings of conditional inference trees. The main difference between the forest and the tree approach is that in the forest each tree is estimated on a random subsample \(b\) of the original data. Thus, in total \(B\) trees are estimated. Furthermore, a random subset of circumstances is used at each splitting point. This guarantees that at some point all circumstances with any kind of informational value will be used as a splitting variable. Furthermore, averaging the result over \(B\) predictions reduces the variance. The individual predictions are a function of \(\alpha\) which stands for the significance level in charge of splits, \(\bar{P}\) i.e. the number of circumstances to be considered, and \(\bar{B}\) the number of subsamples.

cf <- cforest(formula, data19, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.99), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = FALSE, fraction = 0.8))

data19$hat_cf <- predict(cf, newdata = as.data.frame(data19), OOB = TRUE, type = "response")
data19$RMSE <- sqrt(sum((data19$hat_cf - data19$inc_net_log)^2/nrow(data19), na.rm = T))
head(data19$RMSE)
## [1] 0.4813855 0.4813855 0.4813855 0.4813855 0.4813855 0.4813855
varimp(cf, mincriterion = 0, OOB = TRUE) 
##                  sex        country_birth           father_cit 
##         0.0678499187         0.0059050641         0.0056436377 
##           father_edu         father_occup           mother_cit 
##         0.0053886058         0.0028044436         0.0065437766 
##           mother_edu    mother_occup_stat         mother_occup 
##         0.0041194799         0.0009257607         0.0004644832 
##         mother_manag              tenancy             children 
##         0.0017892779        -0.0001982098         0.0003635829 
##               adults       adults_working both_parents_present 
##        -0.0006739003         0.0003050525         0.0005388467
importance_cf <- data.frame(varimp(cf, mincriterion = 0, OOB = TRUE))
names(importance_cf) <- "importance"
importance_cf$var_name = rownames(importance_cf)
importance_cf <- importance_cf  %>% arrange( desc(importance))

We obtain a RMSE for the conditional inference forest of 0.48. Furthermore, we obtain a table of variable importance as identified through the forest, which we arrange in descending order and plot below:

varimpo <- importance_cf %>% ggplot(aes(x = var_name, y = importance)) +
    geom_pointrange(shape = 21, colour = "black", fill = "white", size = 3, stroke = 1, aes(ymin = 0, ymax = importance)) +
    scale_x_discrete(limits = importance_cf$var_name[order(importance_cf$importance)]) +
    labs(title = "Conditional Forest variable importance - Austria 2019", x = "", y = "Mean decrease in sum of squared residuals") +
    coord_flip() +
    theme_light() +
    theme(axis.line = element_blank(), panel.border = element_blank(), panel.grid.major.y=element_blank())

ggplotly(varimpo)

Both the conditional inference tree and the conditional inference forest analysis indicate that sex, is the most important determinant of an individuals income in Austria. We are surprised by this outcome, since (Brunori, Hufe, and Mahler 2018) find that for Austria, the country of birth of the father is the variable determining the initial split of the tree. However, the variable is significantly more important than any other variable under consideration, and while it is arguably not a generationally transmittable characteristic, it is genetic, circumstantial, and completely out of the respondents control to influence.

2.3.6 Boosted Inference Tree

#cf_boosted <- blackboost(formula, data = data19, na.action = na.pass, control = boost_control(), tree_controls = partykit::ctree_control())
# cf_boosted

cf_boosted_train <- train(formula, data19, method = "ctree2", trControl = fitControl, tuneGrid = NULL, na.action = na.pass)

#RMSE

test$At_BT_CF_pred <- predict(cf_boosted_train, newdata = as.data.frame(test))
test$perror <- (test$At_BT_CF_pred - test$inc_net_log)^2
test$RMSE <- sqrt(sum((test$At_BT_CF_pred - test$inc_net_log)^2/nrow(test), na.rm = T))
head(test$RMSE)
## [1] 0.4564731 0.4564731 0.4564731 0.4564731 0.4564731 0.4564731
plot(test$At_BT_CF_pred, test$inc_net_log) #ADD GGPLOT und machs schön!

### Variable Importance

###?????### geht nicht bei boosted tree mit caret package... 

varimp(cf_boosted_train, mincriterion = 0, OOB = TRUE) 
## Error in UseMethod("varimp"): no applicable method for 'varimp' applied to an object of class "c('train', 'train.formula')"
importance_cf_boosted <- data.frame(varimp(cf_boosted_train, mincriterion = 0, OOB = TRUE))
## Error in UseMethod("varimp"): no applicable method for 'varimp' applied to an object of class "c('train', 'train.formula')"
names(importance_cf_boosted) <- "importance"
## Error in names(importance_cf_boosted) <- "importance": object 'importance_cf_boosted' not found
importance_cf_boosted$var_name = rownames(importance_cf_boosted)
## Error in rownames(importance_cf_boosted): object 'importance_cf_boosted' not found
importance_cf_boosted <- importance_cf_boosted  %>% arrange( desc(importance))
## Error in arrange(., desc(importance)): object 'importance_cf_boosted' not found

3 Cross Country Comparison: EU-SILC 2011

In this part of the seminar paper, we attempt to reproduce the findings of (Brunori, Hufe, and Mahler 2018), but unfortunately we do not have access to the actual EU-SILC data from 2011. Instead we reproduce the findings using the synthetic data provided by the European office of statistics (Eurostat) (https://ec.europa.eu/eurostat/web/microdata/statistics-on-income-and-living-conditions).

3.1 Data Wrangling

The original data is not provided as the EU protects the privacy of the original respondents. The idea of the public microdata, is that it allows us to train and write the code using the actual variable names, but not obtaining true results. The EU-SILC public microdata files are fully synthetic and they were simulated using statistical modeling and show the statistical distributions of the original data. The main caveats of this data are, that it cannot be used for statistical inference to the wider population. The results and conclusions obtained from this public microdata are thus to be taken with a grain of salt. Luckily, the individual country datasets are grouped in a coherent manner. We use the EU-SILC data from 2011, as it was the survey when additionally to the usual questions, there were questions on inter-generational transmission. These were questions about the parents of the respondents. Unfortunately however, the data contains various implausible errors, which make it difficult to apply ctree to it. While the synthetic data is similarly distributed to the actual data its missing values are not in any way systematic but random. This makes, cleaning the data basically impossible, since otherwise we lose almost all of our observations. The data sets for Finland, Denmark, and Spain contain more missing values for most of the Ad-hoc questions than answers. And the Italian household data set, as it is provided on the Eurostat website only contains 138 observations, which makes merging it with the 20.000 observation long personal data set useless.

The unique identifier used in all four data sets is the household ID identifier: RX030 in the Personal Register, PX030 in the Personal Data, DB030 in the Household Register, and HB030 in the Household Data file. We only need to combine two of the datasets, namely the Household Register and the Personal Data. Latter contains the Ad-hoc module with the questions on intergenerational characteristics.

Following (Brunori, Hufe, and Mahler 2018) we use the following variables for circumstances: Respondent’s sex (PB150), Respondent’s country of birth (Citizenship as proxy - PB220A), Presence of parents at home (PT010), Number of adults (18 or older) in respondents household (PT020), Number of working adults (18 or older) in respondents household (PT030), Father/Mother country of birth and citizenship (PT060, PT070, PT090, PT100), Father/mother education (PT110, PT120), Father/mother occupational status (PT130, PT160), Father/mother main occupation (PT150,PT180), Managerial position of father/mother(PT140,PT170), Tenancy status of the house in which respondent was living as a child (PT210).

Outcome Variables i.e. Income: Total Household gross income (HY010), Total Disposable Income (HY020), Dwelling Type (HH010), Housing (HH030).

We first use more variables than ultimately used in the analysis. We use the year of birth to calculate the age, and then exclude everyone older than 60 or younger than 27, as was done in the paper we are referring to. We first included both monthly and annual gross income. But in this cross-country analysis we use annual gross income as our outcome variable.

At first we ran the analysis with the citizenship variable included, but we ultimately decided that it is not really a circumstantial variable as Respondents country of birth would have been. Since it is utltimately possible to obtain a new citizenship.

# setting the data path
data_path ="./SILC_2011"

# accessing the data
AT_personal_data <- read.csv(file.path(data_path, "AT_2011p_EUSILC.csv"))
AT_household_data <- read.csv(file.path(data_path, "AT_2011h_EUSILC.csv"))

# change the name of the identifier variable
AT_household_data <- AT_household_data %>% rename("PX030" = HB030)

# joining the data
AT_equality_data <- AT_personal_data %>%  left_join(AT_household_data, by = "PX030")

# Renaming important variables for readability of tree
AT_equality_data <- AT_equality_data %>% select(
  PB140, HY010, PB150, PB220A, PT010, PT020, PT030, PT060, PT070, PT090, PT100, PT110, PT120, PT130, PT160, PT150, PT180, PT140, PT170, PT210, PY200G) %>% mutate(
    age = (2011 - PB140), log_income = log(HY010 + 1) #We add 1 so the logarithm works
  ) %>% filter(
    age %in% (27:59)
  ) %>% mutate(
    citizenship = factor(PB220A, labels = c(1,2,3)) # Citizenship of the respondent
  ) %>%
  rename(
    "year_of_birth" = PB140,        # Respondents year of birth
    "annual_income" = HY010,
    "sex" = PB150,                  # Respondents sex
    "parents_present" = PT010,      # Presence of parents (or those considered as such)
    "adults_home" = PT020,          # Number of adults living in the household when the respondent was 14 years old
    "children_home" = PT030,        # Number of children in the household
    "father_cob" = PT060,           # Country of birth of father
    "father_cit" = PT070,           # Citizenship of father
    "mother_cob" = PT090,           # Country of birth of mother
    "mother_cit" = PT100,           # Citizenship of mother
    "father_edu" = PT110,           # Highest level of education attained by father
    "mother_edu" = PT120,           # Highest level of education attained by mother
    "father_occup_stat" = PT130,    # Activity status of father
    "mother_occup_stat" = PT160,    # Activity status of mother
    "father_occup" = PT150,         # Main occupation (job) of father 
    "mother_occup" = PT180,         # Main occupation (job) of mother
    "father_manag" = PT140,         # Managerial position of father
    "mother_manag" = PT170,         # Managerial position of mother
    "tenancy" = PT210,              # Tenancy status when respondent was 14 years old
    "monthly_income" = PY200G)      # Gross monthly earnings for employees

Summary We provide the summary statistics for Austria, which we obtained using the ‘dfsummary’ from the package ‘summarytools’. Similar to the 2019 data set the ‘AT_equality_data’ does contain almost 7000 observations and no missing entries in our outcome variable annual income. However, it does contain many missing values across the observed circumstances. We chose to not exclude those and deal with these missing entries using the na.action = na.pass command when doing the statistical analysis.

print(dfSummary(AT_equality_data), method="render", style="grid", plain.ascii = F)

Data Frame Summary

AT_equality_data

Dimensions: 6741 x 24
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 year_of_birth [integer] Mean (sd) : 1966.9 (9.2) min < med < max: 1952 < 1966 < 1984 IQR (CV) : 16 (0) 33 distinct values 6741 (100.0%) 0 (0.0%)
2 annual_income [integer] Mean (sd) : 72909.7 (51328.2) min < med < max: 0 < 62595 < 653401 IQR (CV) : 57961 (0.7) 4077 distinct values 6741 (100.0%) 0 (0.0%)
3 sex [integer] Min : 1 Mean : 1.5 Max : 2
1:3320(49.3%)
2:3421(50.7%)
6741 (100.0%) 0 (0.0%)
4 PB220A [character] 1. AT 2. EU 3. Other
5781(85.8%)
352(5.2%)
608(9.0%)
6741 (100.0%) 0 (0.0%)
5 parents_present [integer] Mean (sd) : 1.3 (0.8) min < med < max: 1 < 1 < 5 IQR (CV) : 0 (0.6)
1:3586(86.5%)
2:75(1.8%)
3:385(9.3%)
4:68(1.6%)
5:31(0.7%)
4145 (61.5%) 2596 (38.5%)
6 adults_home [integer] Mean (sd) : 2.7 (1.3) min < med < max: 0 < 2 < 12 IQR (CV) : 1 (0.5) 13 distinct values 4176 (61.9%) 2565 (38.1%)
7 children_home [integer] Mean (sd) : 2.5 (1.6) min < med < max: 1 < 2 < 16 IQR (CV) : 2 (0.6) 14 distinct values 4164 (61.8%) 2577 (38.2%)
8 father_cob [integer] Mean (sd) : 1.3 (0.7) min < med < max: -1 < 1 < 3 IQR (CV) : 0 (0.5)
-1:10(0.2%)
1:3182(78.2%)
2:335(8.2%)
3:542(13.3%)
4069 (60.4%) 2672 (39.6%)
9 father_cit [integer] Mean (sd) : 1.3 (0.7) min < med < max: -1 < 1 < 3 IQR (CV) : 0 (0.5)
-1:8(0.2%)
1:3337(81.1%)
2:224(5.4%)
3:545(13.2%)
4114 (61.0%) 2627 (39.0%)
10 mother_cob [integer] Mean (sd) : 1.4 (0.7) min < med < max: -1 < 1 < 3 IQR (CV) : 0 (0.5)
-1:1(0.0%)
1:3212(77.0%)
2:395(9.5%)
3:564(13.5%)
4172 (61.9%) 2569 (38.1%)
11 mother_cit [integer] Mean (sd) : 1.3 (0.7) min < med < max: -1 < 1 < 3 IQR (CV) : 0 (0.5)
-1:5(0.1%)
1:3323(80.0%)
2:297(7.1%)
3:529(12.7%)
4154 (61.6%) 2587 (38.4%)
12 father_edu [integer] Mean (sd) : 1.7 (0.8) min < med < max: -1 < 2 < 3 IQR (CV) : 1 (0.5)
-1:129(3.1%)
0:24(0.6%)
1:1535(36.8%)
2:1856(44.5%)
3:623(15.0%)
4167 (61.8%) 2574 (38.2%)
13 mother_edu [integer] Mean (sd) : 1.4 (0.7) min < med < max: -1 < 1 < 3 IQR (CV) : 1 (0.5)
-1:60(1.4%)
0:89(2.2%)
1:2271(54.9%)
2:1532(37.0%)
3:187(4.5%)
4139 (61.4%) 2602 (38.6%)
14 father_occup_stat [integer] Mean (sd) : 1.3 (0.7) min < med < max: -1 < 1 < 6 IQR (CV) : 0 (0.5)
-1:31(0.8%)
1:2943(74.8%)
2:869(22.1%)
3:16(0.4%)
4:51(1.3%)
5:3(0.1%)
6:19(0.5%)
3932 (58.3%) 2809 (41.7%)
15 mother_occup_stat [integer] Mean (sd) : 3 (1.9) min < med < max: -1 < 2 < 6 IQR (CV) : 4 (0.6)
-1:18(0.4%)
1:1567(38.3%)
2:665(16.3%)
3:6(0.1%)
4:16(0.4%)
5:1795(43.9%)
6:25(0.6%)
4092 (60.7%) 2649 (39.3%)
16 father_occup [integer] Mean (sd) : 5.6 (2.3) min < med < max: -1 < 6 < 9 IQR (CV) : 3 (0.4) 11 distinct values 3813 (56.6%) 2928 (43.4%)
17 mother_occup [integer] Mean (sd) : 5.5 (2.1) min < med < max: -1 < 5 < 9 IQR (CV) : 1 (0.4) 11 distinct values 2272 (33.7%) 4469 (66.3%)
18 father_manag [integer] Mean (sd) : 1.6 (0.6) min < med < max: -1 < 2 < 2 IQR (CV) : 1 (0.4)
-1:72(1.9%)
1:1412(36.5%)
2:2383(61.6%)
3867 (57.4%) 2874 (42.6%)
19 mother_manag [integer] Mean (sd) : 1.8 (0.5) min < med < max: -1 < 2 < 2 IQR (CV) : 0 (0.3)
-1:25(1.1%)
1:391(17.6%)
2:1810(81.3%)
2226 (33.0%) 4515 (67.0%)
20 tenancy [integer] Mean (sd) : 1.5 (0.7) min < med < max: -1 < 1 < 3 IQR (CV) : 1 (0.5)
-1:7(0.2%)
1:2396(58.1%)
2:1205(29.2%)
3:513(12.4%)
4121 (61.1%) 2620 (38.9%)
21 monthly_income [integer] Mean (sd) : 624.1 (280.8) min < med < max: 1 < 851 < 851 IQR (CV) : 495 (0.5) 651 distinct values 6741 (100.0%) 0 (0.0%)
22 age [numeric] Mean (sd) : 44.1 (9.2) min < med < max: 27 < 45 < 59 IQR (CV) : 16 (0.2) 33 distinct values 6741 (100.0%) 0 (0.0%)
23 log_income [numeric] Mean (sd) : 10.9 (1.1) min < med < max: 0 < 11 < 13.4 IQR (CV) : 0.9 (0.1) 4077 distinct values 6741 (100.0%) 0 (0.0%)
24 citizenship [factor] 1. 1 2. 2 3. 3
5781(85.8%)
352(5.2%)
608(9.0%)
6741 (100.0%) 0 (0.0%)

Generated by summarytools 0.9.8 (R version 4.0.3)
2021-02-17

Summary Statisitcs All Countries

## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
Country Statistics
Country Sample Size Avg. Equ.Income Std. dev. Gini
AT 6741 72909.75 51328.24 0.3620218
FR 11013 59673.42 48605.76 0.3540666
DK 4536 82525.95 48369.46 0.3055454
ES 17160 42864.46 32836.60 0.3851535
FI 8342 62540.79 39466.20 0.3236630
LV 7288 15121.82 11704.88 0.4041037

3.2 Empirical Analysis

3.2.1 Regression Trees

Following (Brunori, Hufe, and Mahler 2018) we split the data into training and testing data by \(2/3:1/3\). Furthermore, we chose to show the results obtained using regression trees obtained from the ‘rpart’ package. The training and test data sets will be continually used also for further analysis when we proceed with ‘cTree’.

set.seed(123)

AT_equality_data <- AT_equality_data %>%
  mutate(train_index = sample(c("train", "test"), nrow(AT_equality_data), replace=TRUE, prob=c(0.67, 0.33)))

AT_train <- AT_equality_data %>% filter(train_index=="train")
AT_test <- AT_equality_data %>% filter(train_index=="test")

formula <- log_income ~ sex + parents_present + adults_home + children_home + father_cob + father_cit + mother_cob + mother_cit + father_edu + mother_edu + father_occup_stat + mother_occup_stat + father_occup + mother_occup + father_manag + mother_manag + tenancy

# AT_tree <- rpart(formula, data = AT_train, cp=.008)

# AT_tree

# rpart.plot(AT_tree, box.palette="RdBu", nn=FALSE, type=2, main = "Regression Tree for Austria 2011")
FI_equality_data <- FI_equality_data %>%
  mutate(train_index = sample(c("train", "test"), nrow(FI_equality_data), replace=TRUE, prob=c(0.67, 0.33)))

FI_train <- FI_equality_data %>% filter(train_index=="train")
FI_test <- FI_equality_data %>% filter(train_index=="test")


#FI_tree <- rpart(formula, data = FI_train, cp=.003)

#FI_tree

#rpart.plot(FI_tree, box.palette="RdBu", nn=FALSE, type=2, main = "Regression Tree for Finland 2011")
LV_equality_data <- LV_equality_data %>%
  mutate(train_index = sample(c("train", "test"), nrow(LV_equality_data), replace=TRUE, prob=c(0.67, 0.33)))

LV_train <- LV_equality_data %>% filter(train_index=="train")
LV_test <- LV_equality_data %>% filter(train_index=="test")


#LV_tree <- rpart(formula, data = LV_train, cp=.003)

#LV_tree

#rpart.plot(LV_tree, box.palette="RdBu", nn=FALSE, type=2, main = "Regression Tree for Latvia 2011")

3.2.2 Conditional Inference Trees

AT_Ctree <- ctree(formula, data = AT_train, alpha = 0.05, maxdepth = 5)
AT_Ctree
## 
## Model formula:
## log_income ~ sex + parents_present + adults_home + children_home + 
##     father_cob + father_cit + mother_cob + mother_cit + father_edu + 
##     mother_edu + father_occup_stat + mother_occup_stat + father_occup + 
##     mother_occup + father_manag + mother_manag + tenancy
## 
## Fitted party:
## [1] root
## |   [2] mother_cob <= 1
## |   |   [3] sex <= 1
## |   |   |   [4] father_edu <= 1: 10.947 (n = 669, err = 527.8)
## |   |   |   [5] father_edu > 1
## |   |   |   |   [6] mother_cit <= 2: 11.069 (n = 1000, err = 699.4)
## |   |   |   |   [7] mother_cit > 2: 10.940 (n = 110, err = 53.8)
## |   |   [8] sex > 1
## |   |   |   [9] mother_cit <= 2
## |   |   |   |   [10] father_cit <= 2: 10.875 (n = 1251, err = 1683.8)
## |   |   |   |   [11] father_cit > 2: 10.643 (n = 232, err = 519.4)
## |   |   |   [12] mother_cit > 2: 10.682 (n = 245, err = 429.6)
## |   [13] mother_cob > 1
## |   |   [14] sex <= 1
## |   |   |   [15] children_home <= 5
## |   |   |   |   [16] mother_cob <= 2: 10.945 (n = 195, err = 167.1)
## |   |   |   |   [17] mother_cob > 2: 10.788 (n = 260, err = 223.9)
## |   |   |   [18] children_home > 5: 10.037 (n = 18, err = 115.6)
## |   |   [19] sex > 1: 10.538 (n = 556, err = 1444.8)
## 
## Number of inner nodes:     9
## Number of terminal nodes: 10
plot(AT_Ctree, type = "simple", gp = gpar(fontsize = 6),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Austria 2011")

3.2.3 Cross Validation using the Caret package

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = T)
AT_cctree1 <- train(formula, data = AT_train, method = "ctree", trControl = fitControl, na.action = na.pass)

AT_cctree1 #This is the suggested tree we get from applying Caret
## Conditional Inference Tree 
## 
## 4536 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 4082, 4083, 4082, 4082, 4083, 4082, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  RMSE      Rsquared     MAE      
##   0.01          1.165410  0.007820882  0.7108191
##   0.50          1.143866  0.008421223  0.6876553
##   0.99          1.138295  0.010786909  0.6829441
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mincriterion = 0.99.
AT_cct <- ctree(formula, data = AT_train, mincriterion = 0.99) #Using the suggestion we generate a Conditional Inference Tree and plot it as our final result

plot(AT_cct,gp = gpar(fontsize = 8),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Austria 2011 - Cross Validated with Caret")

AT_ctree2 <- ctree(formula, data = AT_train, control = ctree_control(testtype = "Bonferroni", mincriterion = 0.99)) 
AT_ctree2
## 
## Model formula:
## log_income ~ sex + parents_present + adults_home + children_home + 
##     father_cob + father_cit + mother_cob + mother_cit + father_edu + 
##     mother_edu + father_occup_stat + mother_occup_stat + father_occup + 
##     mother_occup + father_manag + mother_manag + tenancy
## 
## Fitted party:
## [1] root
## |   [2] mother_cob <= 1
## |   |   [3] mother_cit <= 2
## |   |   |   [4] sex <= 1
## |   |   |   |   [5] father_edu <= 1: 10.960 (n = 605, err = 530.3)
## |   |   |   |   [6] father_edu > 1: 11.052 (n = 993, err = 724.7)
## |   |   |   [7] sex > 1: 10.837 (n = 1508, err = 2049.6)
## |   |   [8] mother_cit > 2: 10.776 (n = 397, err = 560.8)
## |   [9] mother_cob > 1
## |   |   [10] sex <= 1: 10.869 (n = 470, err = 458.2)
## |   |   [11] sex > 1: 10.537 (n = 563, err = 1577.5)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6
plot(AT_ctree2, type = "simple",gp = gpar(fontsize = 6),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Austria 2011 - Cross Validated with Ctree")

AT_test$P_AtCt <- predict(AT_ctree2, newdata = as.data.frame(AT_test))

AT_test$perror <- (AT_test$P_AtCt - AT_test$log_income)^2

AT_test$RMSE <- sqrt(sum((AT_test$P_AtCt - AT_test$log_income)^2/nrow(AT_test), na.rm = T))

head(AT_test$RMSE)
## [1] 1.023578 1.023578 1.023578 1.023578 1.023578 1.023578
# For Austria we have a RMSE of 1.02, which is not very good. But is most likely attributed to the synthetic data. 

# Plot the Errors somehow
FR_Ctree <- ctree(formula, data = FR_train)
FR_Ctree
## 
## Model formula:
## log_income ~ sex + parents_present + adults_home + children_home + 
##     father_cob + father_cit + mother_cob + mother_cit + father_edu + 
##     mother_edu + father_occup_stat + mother_occup_stat + father_occup + 
##     mother_occup + father_manag + mother_manag + tenancy
## 
## Fitted party:
## [1] root
## |   [2] mother_cit <= 3
## |   |   [3] sex <= 1
## |   |   |   [4] father_edu <= 1: 10.769 (n = 2665, err = 1739.3)
## |   |   |   [5] father_edu > 1: 10.847 (n = 725, err = 456.2)
## |   |   [6] sex > 1
## |   |   |   [7] father_cob <= 3: 10.704 (n = 3114, err = 2827.5)
## |   |   |   [8] father_cob > 3: 10.525 (n = 427, err = 534.5)
## |   [9] mother_cit > 3
## |   |   [10] mother_cob <= 3: 10.596 (n = 401, err = 377.3)
## |   |   [11] mother_cob > 3: 10.314 (n = 53, err = 75.9)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6
plot(FR_Ctree, type = "simple",gp = gpar(fontsize = 6),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for France 2011")

FR_cctree <- train(formula, data = FR_train, method = "ctree", trControl = fitControl, na.action = na.pass)

FR_cctree #This is the suggested tree we get from applying Caret
## Conditional Inference Tree 
## 
## 7385 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 6647, 6647, 6646, 6646, 6647, 6646, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  RMSE       Rsquared     MAE      
##   0.01          0.9076050  0.008670050  0.5856330
##   0.50          0.9003582  0.008738263  0.5761678
##   0.99          0.9003106  0.006501001  0.5752667
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mincriterion = 0.99.
FR_cct <- ctree(formula, data = FR_train, mincriterion = 0.99) #Using the suggestion we generate a Conditional Inference Tree and plot it as our final result

plot(FR_cct, type = "simple",gp = gpar(fontsize = 8),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for France 2011 - Cross Validated")

FR_test$P_FRCt <- predict(FR_cct, newdata = as.data.frame(FR_test))

FR_test$perror <- (FR_test$P_FRCt - FR_test$log_income)^2

FR_test$RMSE <- sqrt(sum((FR_test$P_FRCt - FR_test$log_income)^2/nrow(FR_test), na.rm = T))

head(FR_test$RMSE)
## [1] 0.8556304 0.8556304 0.8556304 0.8556304 0.8556304 0.8556304
# RMSE 0.86
# ES_Ctree <- ctree(formula, data = ES_train)
# ES_Ctree
# 
# plot(ES_Ctree, type = "simple",gp = gpar(fontsize = 6),
#   inner_panel=node_inner,
#   ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Spain 2011")
# 
# ES_cctree <- train(formula, data = ES_train, method = "ctree", trControl = fitControl, na.action = na.omit) #The spanish synthetic dataset has many NA`s, the output of the tree is unreliable as we don't have information on the errors
# 
# ES_cctree #This is the suggested tree we get from applying Caret
# 
# 
# ES_cct <- ctree(formula, data = ES_train, mincriterion = 0.99) #Using the suggestion we generate a Conditional Inference Tree and plot it as our final result
# 
# plot(ES_cct, type = "simple",gp = gpar(fontsize = 8),
#   inner_panel=node_inner,
#   ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Spain 2011 - Cross Validated")
# IT_Ctree <- ctree(formula, data = IT_train, control = ctree_control())
# IT_Ctree
# 
# plot(IT_Ctree,gp = gpar(fontsize = 6),
#   inner_panel=node_inner,
#   ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Italy 2011")
# 
# 
# IT_cctree <- train(formula, data = IT_train, method = "ctree", trControl = fitControl, na.action = na.pass)
# 
# IT_cctree #suggests using mincriterion 0.99
# plot(IT_cctree$finalModel)
# 
# #plotted as ctree
# IT_cct <- ctree(formula, data = IT_train, mincriterion = 0.99)
# 
# plot(IT_cct,gp = gpar(fontsize = 8),
#   inner_panel=node_inner,
#   ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Italy 2011 - Cross Validated")


#In Italy we have too many NAs among the circumstantial 
# IT_test$P_Ct <- predict(IT_cct, newdata = as.data.frame(IT_test))
# 
# IT_test$perror <- (IT_test$P_Ct - IT_test$log_income)^2
# 
# IT_test$RMSE <- sqrt(sum((IT_test$P_Ct - IT_test$log_income)^2/nrow(IT_test), na.rm = T))
# The Denmark set has too many missing values, we cannot evaluate it with the given variables

# DK_cctree <- train(formula, data = DK_train, method = "ctree", trControl = fitControl, na.action = na.omit)
# 
# DK_cctree
# The Finland set has too many missing values
# FI_cctree <- train(formula, data = FI_train, method = "ctree", trControl = fitControl, na.action = na.omit)
# 
# FI_cctree
LV_Ctree <- ctree(formula, data = LV_train)
LV_Ctree
## 
## Model formula:
## log_income ~ sex + parents_present + adults_home + children_home + 
##     father_cob + father_cit + mother_cob + mother_cit + father_edu + 
##     mother_edu + father_occup_stat + mother_occup_stat + father_occup + 
##     mother_occup + father_manag + mother_manag + tenancy
## 
## Fitted party:
## [1] root
## |   [2] mother_edu <= 2
## |   |   [3] mother_occup <= 6
## |   |   |   [4] sex <= 1: 9.271 (n = 1325, err = 2070.4)
## |   |   |   [5] sex > 1: 9.126 (n = 1398, err = 2714.1)
## |   |   [6] mother_occup > 6: 9.084 (n = 1435, err = 3372.1)
## |   [7] mother_edu > 2
## |   |   [8] sex <= 1: 9.409 (n = 359, err = 439.3)
## |   |   [9] sex > 1: 9.220 (n = 364, err = 608.0)
## 
## Number of inner nodes:    4
## Number of terminal nodes: 5
plot(LV_Ctree, type = "simple", gp = gpar(fontsize = 8),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Latvia 2011")

LV_cctree <- train(formula, data = LV_train, method = "ctree", trControl = fitControl, na.action = na.pass)
LV_cctree #again we choose Mincriterion 0.99 based on the RMSE
## Conditional Inference Tree 
## 
## 4881 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 4393, 4393, 4394, 4393, 4393, 4392, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  RMSE      Rsquared     MAE      
##   0.01          1.386232  0.005658932  0.8467050
##   0.50          1.369181  0.007020794  0.8282032
##   0.99          1.368459  0.003908415  0.8250767
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mincriterion = 0.99.
plot(LV_cctree$finalModel)

# we do the control step using the default ctree_control function
LV_cct <- ctree(formula, data = LV_train, control = ctree_control(testtype = "Bonferroni", mincriterion = 0.99))


plot(LV_cct,gp = gpar(fontsize = 8),
  inner_panel=node_inner,
  ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Latvia 2011 - Cross Validated")

LV_test$P_Ct <- predict(LV_cct, newdata = as.data.frame(LV_test))

LV_test$perror <- (LV_test$P_Ct - LV_test$log_income)^2

LV_test$RMSE <- sqrt(sum((LV_test$P_Ct - LV_test$log_income)^2/nrow(LV_test), na.rm = T))

#RMSE of 1.4 which is not so good, and does not speak of good predictive capabilities of the model

3.2.4 Conditional Inference Forest

AT_cf <- cforest(formula, AT_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.99), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = FALSE, fraction = 0.8))

AThat_cf <- predict(AT_cf, newdata = AT_test, OOB = TRUE, type = "response")

varimp(AT_cf, mincriterion = 0, OOB = TRUE) 
##               sex   parents_present       adults_home     children_home 
##      1.762207e-02      6.569783e-04     -2.823649e-04     -8.881077e-05 
##        father_cob        father_cit        mother_cob        mother_cit 
##      2.569883e-03      2.651544e-03      8.431948e-03      4.440571e-03 
##        father_edu        mother_edu father_occup_stat      father_occup 
##      1.137809e-03     -1.566985e-04      1.384726e-03      7.708348e-05 
##      mother_occup      father_manag 
##     -1.021288e-02     -2.142929e-03
importance_cf <- data.frame(varimp(AT_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf) <- "importance"
importance_cf$var_name = rownames(importance_cf)
importance_cf <- importance_cf  %>% 
  arrange( desc(importance))  %>%
  mutate(Country = "AT")
varimpo2 <- ggplot(importance_cf, aes(x = var_name, y = importance)) +
  geom_pointrange(shape = 21, colour = "black", fill = "white", size = 3, stroke = 1, aes(ymin = 0, ymax = importance)) +
    scale_x_discrete(limits = importance_cf$var_name[order(importance_cf$importance)]) +
    labs(title = "Conditional Forest variable importance - Austria 2011", x = "", y = "Mean decrease in sum of squared residuals") +
    coord_flip() +
    theme_light() +
    theme(axis.line = element_blank(), panel.border = element_blank(), panel.grid.major.y=element_blank())

ggplotly(varimpo2)
FR_cf <- cforest(formula, FR_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.99), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = FALSE, fraction = 0.8))
importance_cf_FR <- data.frame(varimp(FR_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf_FR) <- "importance"
importance_cf_FR$var_name = rownames(importance_cf_FR)
importance_cf_FR <- importance_cf_FR  %>% arrange(desc(importance)) %>% mutate(Country = "FR")
# FI_cf <- cforest(formula, FI_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.99), ytrafo = NULL, scores = NULL, ntree = 200L, perturb = list(replace = FALSE, fraction = 0.8))
# 
# importance_cf_FI <- data.frame(varimp(FI_cf, mincriterion = 0, OOB = TRUE))
# names(importance_cf_FI) <- "importance"
# importance_cf_FI$var_name = rownames(importance_cf_FI)
# importance_cf_FI <- importance_cf_FI  %>% arrange(desc(importance)) %>% mutate(Country = "FI")
LV_cf <- cforest(formula, LV_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.99), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = FALSE, fraction = 0.8))

importance_cf_LV <- data.frame(varimp(LV_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf_LV) <- "importance"
importance_cf_LV$var_name = rownames(importance_cf_LV)
importance_cf_LV <- importance_cf_LV  %>% arrange(desc(importance)) %>% mutate(Country = "LV")
  • Variable Importance Countries*
df <- full_join(importance_cf, importance_cf_FR)
## Joining, by = c("importance", "var_name", "Country")
#df <- full_join(df, importance_cf_IT)
df <- full_join(df, importance_cf_LV) %>% group_by(Country)
## Joining, by = c("importance", "var_name", "Country")
dfvarimp <- ggplot(df, aes(x = var_name , y = importance, shape = Country, color=Country)) +
    geom_point() +
    scale_x_discrete(limits = importance_cf$var_name[order(importance_cf$importance)]) +
    labs(title = "Conditional Forest variable importance - Country Comparison", x = "", y = "Variable importance") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplotly(dfvarimp)
varimpo3 <- ggplot(importance_cf_FR, aes(x = var_name, y = importance)) +
  geom_pointrange(shape = 21, colour = "black", fill = "white", size = 3, stroke = 1, aes(ymin = 0, ymax = importance)) +
    scale_x_discrete(limits = importance_cf_FR$var_name[order(importance_cf_FR$importance)]) +
    labs(title = "Conditional Forest variable importance - France 2011", x = "", y = "Mean decrease in sum of squared residuals") +
    coord_flip() +
    theme_light() +
    theme(axis.line = element_blank(), panel.border = element_blank(), panel.grid.major.y=element_blank())

ggplotly(varimpo3)
varimpo4 <- ggplot(importance_cf_LV, aes(x = var_name, y = importance)) +
  geom_pointrange(shape = 21, colour = "black", fill = "white", size = 3, stroke = 1, aes(ymin = 0, ymax = importance)) +
    scale_x_discrete(limits = importance_cf_LV$var_name[order(importance_cf_LV$importance)]) +
    labs(title = "Conditional Forest variable importance - Latvia 2011", x = "", y = "Mean decrease in sum of squared residuals") +
    coord_flip() +
    theme_light() +
    theme(axis.line = element_blank(), panel.border = element_blank(), panel.grid.major.y=element_blank())

ggplotly(varimpo4)

4 Conclusion

References

Brunori, Paolo, Paul Hufe, and Gerszon Daniel Mahler. 2018. “The Roots of Inequality: Estimating Inequality of Opportunity from Regression Trees.” Ifo Working Paper Series 252. ifo Institute - Leibniz Institute for Economic Research at the University of Munich. https://ideas.repec.org/p/ces/ifowps/_252.html.

Brunori, Paolo, and Guido Neidhoefer. 2020. “The Evolution of Inequality of Opportunity in Germany: A Machine Learning Approach.” SERIES 01-2020. Dipartimento di Economia e Finanza - Università degli Studi di Bari "Aldo Moro". http://www.seriesworkingpapers.it/RePEc/bai/series/SERIES_WP_01-2020.pdf.

Torsten Hothorn, Achim Zeileis, Kurt Hornik. n.d. “Ctree: Conditional Inference Trees: R Vignette.” https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf.